TMDB 5000 Movie Dataset (Movie Recommeddation System)

Types of Recommendation Systems :

1. Content Based Recommendation System :

This system promotes or recommends movies to user based on the movies that they have watched before. For example , if person watched action movies before, then it will recommend action movies for him.

2. Popularity Based Recommendation System :

This system will recommend top movies in film platforms such as Netflix or cinemas.

3. Collaborative Recommendation System :

This system groups people based on their watching pattern. Then if a user watch a film of this group's films then the system will recommend the films watched by this group to the user. (Recommend based on other previous data).

🚨 My note: We will take an input from the user : So, we can use the second and third systems (Techniques).

Workflow :

1. Data Collection :

We need to have a data of this movies. (Movie description, Type of the movie . . . etc).

2. Data PreProcessing :

Clean data for any missing or incomplete values.

3. Feature Extraction :

There are textual features in data frame , we can not ues it directly. So, We need to convet into  meaningful numerical values) 

4. Find the Similarity :

We have 5000 movie and we want to find which movies are similar to each other by giving them a similarity score (Similarity Confidence Score).

5. User Input :

Ask user for his input , so based on user input we should suggest which movie user can watch.

6. Use Cosine Similarity :

This percent similarity algorithm is used in order to find the similarity between the vectors so here we will just converting each movies into a kind of a vector and we will try to find the similarity between them using Cosine-similarity. So when a user gives a movie name, we will try to compare that movie and we will just try to find which movies are similar to the one given by the user. now we will get a list of movies and we can

cosine.png

Import Libraries :

Describe the data :

🚨 Note: From these statistical methods , We can see that it may be some wrong values and outliers.

How to deal with outliers data ?

🚨 Budget column: 1. There are films that have budget less than 100 which is so small. 2. There are high percentage of outliers in this column (It will be not good to drop them) ,So I will replace it with median. 3. Fortunately, This column is not important in recommendation process so i will exclude it later.
🚨 Revenue column: 1. There are films that have revenue less than 100 which is so small. 2. There are some outliers in this column (It will not be good to drop them) ,So I will replace it with median. 3. Fortunately, This column is not important in recommendation process so i will exclude it later.
🚨 Runtime: 1. There are some rows that have runtime (film duration equal to 0) Zero , which is wrong values. 2. This column have outliers (Not much). 3. This column will be important in the process of recommendation , so I drop outlier columns.

Null values:

🚨 My note: Homepage : - It has 3091 null values. - How I deal with it : I delete this column , It is not useful for me in the recommendation. Overview : - It has 3 null values. - How I deal with it : I drop these 3 rows. Release date : - It has 1 null values. - How I deal with it : I drop these 1 row. Runtime : - It has 2 null values. - How I deal with it : I drop these 2 rows. Tagline : - It has 844 null values. - How I deal with it : I replace these rows with empty string as num of rows are very big. Also, this feature will be helpfull for me in the in the recommendation process.

Drop some columns and rows with null value :

🚨 My note: Number of column and rows not affect (38 rows deleted , 3 column deleted)

Correlation Matrix :

🚨 My note: 1. Vote Count has a strong correlation with popularity. This is logical. It means the more vote count the more popularity the film is 2. Vote count & vote average & Popularit have a low correlation with runtime. Thats mean that the time of the film not affect in the vote of people and its popularity. 3. Revenue has very low correlation with Runtime. So that means that the time of the film have no relation with its revenue.

Change release_date to year to easily deal with column :

Make a new column for runtime types :

0 => 
duration <= 40       => Short Movies 
1 => 40 < duration <= 70  => Mediam Duration Movies
2 => duration > 70        => Long Duration Movies

Genre Extraction function : from raw data for the creation of tags :

Function for extracting top(first) 8 actors from the movie :

Function to fetch the director of movie from the crew column :

Remove spaces between words :

Lower casing all the alphabets in the tags column :

Apply Steming to remove similarities/duplications in words list :

Convert text to matrix :

----------------------------------------------------------------------------------------------------------

Visulization :

Recommendation

How the recommender function works ?

Bar plot titles and similarity scores :

Machine Learning Model

In this section we will further clean the data and then we will create a model that predicts the rating of a movie

Cleaning the data

Importing needed packages

Importing the csv files

Merging the two files together

Droping things with low correlation with vote_average

We referred to the correlation matrix at the start of the section. It must be clarified that we have droped the budget and revenue, even though they seem to be important and a good predictor for a movies success. However, the correlation matrix displayed very poor relationship of between budget or revenue with vote_average

Change the json files to lists

  1. Change the list of dictionaries to the needed extractions in a list
  2. Pinpoint the rows that we will need to drop because they are empty
  3. Update the rows of our dataframe, to maintain the integrity of the rows' numbering

Generate binary lists 🔢

Prediction Model🔢

Similarity Function 👀

Score Predictor 🔮

  1. Input the movie you want to predict its score
  2. Find the similarity between all other movies
  3. Find the top 10 similar movies
  4. Find the average rating

Our Accuracy 📉

Answering Questions

What are the movie genres available?

Create a counter for the frequency table

Create a frequency table

Create the plot

As we can see from the visualization the top three repeated genres are (Drama, comedy, thriller)

What are the top 10 rated movies and what do they have in common?

As we can see from the visualizations that the Minions is the most popular movie.

What are the number of movies per year?

Answer: 2009 was the top year that had a 241 movie

The top 10 actors who contributed in most successsful movies

The actor that had the most successful movies is "Johnny Depp"

Which movie has the highest budget?

"Pirates of the caribbean:On Stranger Tides" had the highest budget

Which movie is longest movie?

"The story of Venezuelan revolutionary" is the longest movie

In which month most movies are released from 1921 to 2017?